Chapter 9 Structured Corpus

There are a lot of pre-collected corpora available for lingustic studies. This chapter will demonstrate how you can load existing corpora in R and perform basic corpus analysis with these data.

9.1 NCCU Spoken Mandarin

CHILDES format

9.1.3 Metadata vs. Transcript

9.1.4 Word Tokenization

9.1.5 Word frequencies and Wordcloud

9.1.6 Concordances

9.1.7 N-grams (Lexical Bundles)

9.2 Connecting SPID to Metadata

Based on the metadata of each file hedaer, we can extract demographic information related to each speaker, including their ID, age, gender, etc.

9.3 More Socialinguistic Analyses

9.3.1 Check Ngram Distribution By Age Groups

Below20 Word Cloud

Order ggplot barplots by factor frequencies

9.3.2 Check Word Distribution of different genders